%matplotlib inline
import os
import numpy as np
import calendar
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pooch # download data / avoid re-downloading
from IPython import get_ipython
sns.set_palette("colorblind")
pd.options.display.max_rows = 8Pandas
Disclaimer: this course is adapted from the work Pandas tutorial by Joris Van den Bossche. R users might also want to read Pandas: Comparison with R / R libraries for a smooth start in Pandas.
First case study: Titanic dataset
First, it is important to download automatically remote files for reproducibility (and avoid typing names manually)
url = "http://josephsalmon.eu/enseignement/datasets/titanic.csv"
path_target = "./titanic.csv"
path, fname = os.path.split(path_target)
pooch.retrieve(url, path=path, fname=fname, known_hash=None) # if needed `pip install pooch`'/home/bcharlier/enseignement/HAX712/HAX712X/Courses/Pandas/titanic.csv'
Reading the file as a pandas dataframe:
df_titanic_raw = pd.read_csv("titanic.csv")Visualize the end of the dataset:
df_titanic_raw.tail(n=3)| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.45 | NaN | S |
| 889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.00 | C148 | C |
| 890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.75 | NaN | Q |
Visualize the beginning of the dataset:
df_titanic_raw.head(n=5)| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
Missing values
It is common to encounter features/covariates with missing values. In pandas they were mostly handled as np.nan (not a number). In the future, they will be treated as NA (Not Available), in a similar way as in R; see the Pandas documentation on missing data for standard behavior and details.
Note that the main difference between pd.NA and np.nan is that pd.NA propagates even for comparisons:
pd.NA == 1<NA>
whereas
np.nan == 1False
Testing the presence of missing values
pd.isna(pd.NA)
pd.isna(np.nan)True
The simplest strategy (when you can / when you have enough samples) consists of removing all nans/NAs.
df_titanic = df_titanic_raw.dropna()
df_titanic.tail(3)| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 879 | 880 | 1 | 1 | Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | female | 56.0 | 0 | 1 | 11767 | 83.1583 | C50 | C |
| 887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
| 889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
# Useful info on the dataset (especially missing values!)
df_titanic.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 183 entries, 1 to 889
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 183 non-null int64
1 Survived 183 non-null int64
2 Pclass 183 non-null int64
3 Name 183 non-null object
4 Sex 183 non-null object
5 Age 183 non-null float64
6 SibSp 183 non-null int64
7 Parch 183 non-null int64
8 Ticket 183 non-null object
9 Fare 183 non-null float64
10 Cabin 183 non-null object
11 Embarked 183 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 18.6+ KB
# Check that the `Cabin` information is mostly missing; the same hold for `Age`
df_titanic_raw.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
Description of the titanic.csv dataset
Details of the dataset are given here
Survived: Survival 0 = No, 1 = YesPclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rdSex: Sex male/femaleAge: Age in yearsSibsp: # of siblings/spouses aboard the TitanicParch: # of parents/children aboard the TitanicTicket: Ticket numberFare: Passenger fareCabin: Cabin numberEmbarked: Port of Embarkation C = Cherbourg, Q = Queenstown, S = SouthamptonName: Name of the passengerPassengerId: Number to identify passenger
For those interested, an extended version of the dataset is available here https://biostat.app.vumc.org/wiki/pub/Main/DataSets/titanic.txt.
Simple descriptive statistics
df_titanic.describe()| PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
|---|---|---|---|---|---|---|---|
| count | 183.000000 | 183.000000 | 183.000000 | 183.000000 | 183.000000 | 183.000000 | 183.000000 |
| mean | 455.366120 | 0.672131 | 1.191257 | 35.674426 | 0.464481 | 0.475410 | 78.682469 |
| std | 247.052476 | 0.470725 | 0.515187 | 15.643866 | 0.644159 | 0.754617 | 76.347843 |
| min | 2.000000 | 0.000000 | 1.000000 | 0.920000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 263.500000 | 0.000000 | 1.000000 | 24.000000 | 0.000000 | 0.000000 | 29.700000 |
| 50% | 457.000000 | 1.000000 | 1.000000 | 36.000000 | 0.000000 | 0.000000 | 57.000000 |
| 75% | 676.000000 | 1.000000 | 1.000000 | 47.500000 | 1.000000 | 1.000000 | 90.000000 |
| max | 890.000000 | 1.000000 | 3.000000 | 80.000000 | 3.000000 | 4.000000 | 512.329200 |
Visualization
- Histograms (please avoid…often useless)
fig, ax = plt.subplots(1, 1, figsize=(5, 5))
ax.hist(df_titanic['Age'], density=True, bins=25)
plt.xlabel('Age')
plt.ylabel('Proportion')
plt.title("Passager age histogram")
plt.show()
- Kernel Density Estimate (KDE): :
fig, ax = plt.subplots(1, 1, figsize=(5, 5))
sns.kdeplot(
df_titanic["Age"], ax=ax, fill=True, cut=0, bw_adjust=0.1
)
plt.xlabel("Proportion")
plt.ylabel("Age")
plt.title("Passager age kernel density estimate")
plt.tight_layout()
plt.show()
Note: the bandwidth parameter (here encoded as bw_adjust) controls the smoothing level. It is a common parameter in kernel smoothing, investigated thoroughly in the non-parametric statistics literature.
- Swarmplot:
fig, ax = plt.subplots(1, 1, figsize=(5, 5))
sns.swarmplot(
data=df_titanic_raw,
ax=ax,
x="Sex",
y="Age",
hue="Survived",
palette={0: "r", 1: "k"},
order=["female", "male"],
)
plt.title("Passager age by gender/survival")
plt.legend(labels=["Died", "Survived"], loc="upper left")
plt.tight_layout()
plt.show()/home/bcharlier/.conda/envs/hax712/lib/python3.9/site-packages/seaborn/categorical.py:3544: UserWarning:
6.8% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.

Interactivity
Interactive interaction with codes and output is nowadays getting easier and easier (see also Shiny app in R-software). In Python, one can use plotly, or widgets and the interact package for this purpose. We are going to visualize that on the simple KDE and histogram examples.
import plotly.graph_objects as go
import numpy as np
from scipy import stats
# Create figure
fig = go.FigureWidget()
# Add traces, one for each slider step
bws = np.arange(0.01, 0.51, 0.01)
xx = np.arange(0, 100, 0.5)
for step in bws:
kernel = stats.gaussian_kde(df_titanic['Age'], bw_method=step)
yy = kernel(xx)
fig.add_trace(
go.Scatter(
visible=False,
fill='tozeroy',
fillcolor='rgba(67, 67, 67, 0.5)',
line=dict(color="black", width=2),
name=f"Bw = {step:.2f}",
x=xx,
y=yy))
# Make 10th trace visible
fig.data[10].visible = True
# Create and add slider
steps = []
for i in range(len(fig.data)):
step = dict(
method="update",
args=[{"visible": [False] * len(fig.data)},
# {"title": f"{fig.data[i].name}"}
],
label=f"{bws[i]:.2f}"
)
step["args"][0]["visible"][i] = True # Toggle i'th trace to "visible"
steps.append(step)
sliders = [dict(
active=10,
currentvalue={"prefix": "Bandwidth = "},
pad={"t": 50},
steps=steps
)]
fig.update_layout(
sliders=sliders,
template="simple_white",
title=f"Impact of the bandwidth on the KDE",
showlegend=False,
)
fig.show()import plotly.graph_objects as go
import numpy as np
from scipy import stats
# Create figure
fig = go.FigureWidget()
# Add traces, one for each slider step
bws = np.arange(0.01, 5.51, 0.3)
xx = np.arange(0, 100, 0.5)
for step in bws:
fig.add_trace(
go.Histogram(
x=df_titanic['Age'],
xbins=dict( # bins used for histogram
start=xx.min(),
end=xx.max(),
size=step
),
autobinx = False,
visible=False,
marker_color="rgba(67, 67, 67, 0.5)",
name=f"Bandwidth = {step:.2f}"
))
# Make 10th trace visible
fig.data[10].visible = True
# Create and add slider
steps = []
for i in range(len(fig.data)):
step = dict(
method="update",
args=[{"visible": [False] * len(fig.data)},
# {"title": f"{fig.data[i].name}"}
],
label=f"{bws[i]:.2f}"
)
step["args"][0]["visible"][i] = True # Toggle i'th trace to "visible"
steps.append(step)
sliders = [dict(
active=10,
currentvalue={"prefix": "Bandwidth = "},
pad={"t": 50},
steps=steps
)]
fig.update_layout(
sliders=sliders,
template="simple_white",
title=f"Impact of the bandwidth on histograms",
showlegend=False,
)
fig.show()IPython widgets (in Jupyter notebooks)
import ipywidgets as widgets
from ipywidgets import interact, fixed
def hist_explore(
dataset=df_titanic,
variable=df_titanic.columns,
n_bins=24,
alpha=0.25,
density=False,
):
fig, ax = plt.subplots(1, 1, figsize=(5, 5))
ax.hist(
dataset[variable], density=density, bins=n_bins, alpha=alpha
) # standardization
plt.ylabel("Density level")
plt.title(f"Dataset {dataset.attrs['name']}:\n Histogram for passengers' age")
plt.tight_layout()
plt.show()
interact(
hist_explore,
dataset=fixed(df_titanic),
n_bins=(1, 50, 1),
alpha=(0, 1, 0.1),
density=False,
)def kde_explore(dataset=df_titanic, variable=df_titanic.columns, bw=5):
fig, ax = plt.subplots(1, 1, figsize=(5, 5))
sns.kdeplot(dataset[variable], bw_adjust=bw, fill=True, cut=0, ax=ax)
plt.ylabel("Density level")
plt.title(f"Dataset:\n KDE for passengers' {variable}")
plt.tight_layout()
plt.show()
interact(kde_explore, dataset=fixed(df_titanic), bw=(0.001, 2, 0.01))Groupby function
How does the survival rate change w.r.t. to sex?
df_titanic_raw.groupby('Sex')[['Survived']].aggregate(lambda x: x.mean())| Survived | |
|---|---|
| Sex | |
| female | 0.742038 |
| male | 0.188908 |
How does the survival rate change w.r.t. the class?
df_titanic.columnsIndex(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
fig, ax = plt.subplots(1, 1, figsize=(5, 5))
df_titanic.groupby('Pclass')['Survived'].aggregate(lambda x:
x.mean()).plot(ax=ax,kind='bar')
plt.xlabel('Classe')
plt.ylabel('Taux de survie')
plt.title('Taux de survie par classe')
plt.show()
catplot: a visual groupby
ax=sns.catplot(
x="Pclass",
y="Age",
hue="Sex",
palette={'female': 'red', 'male': 'b'},
data=df_titanic_raw,
jitter = '0.2',
s=8,
)
sns.move_legend(ax, "upper left", bbox_to_anchor=(0.8, 0.8))
plt.show()
ax=sns.catplot(
x="Pclass",
y="Age",
hue="Sex",
palette={'female': 'red', 'male': 'b'},
alpha=0.8,
data=df_titanic_raw,
kind='swarm',
s=11,
)
sns.move_legend(ax, "upper left", bbox_to_anchor=(0.8, 0.8))
plt.show()
ax=sns.catplot(
x="Sex",
y="Age",
hue="Sex",
palette={'female': 'red', 'male': 'b'},
col='Pclass',
alpha=0.8,
data=df_titanic_raw,
kind='swarm',
s=6,
height=5,
aspect=0.35
)
plt.show()
ax=sns.catplot(x='Pclass',
y='Age',
hue="Sex",
palette={'female': 'red', 'male': 'b'},
data=df_titanic_raw,
kind="violin",
alpha=0.8,
)
sns.move_legend(ax, "upper left", bbox_to_anchor=(0.8, 0.8))
plt.show()
Beware: large difference in sex ratio by class
df_titanic_raw.groupby(['Sex', 'Pclass'])[['Sex']].count()
df_titanic_raw.groupby(['Sex'])[['Sex']].count()| Sex | |
|---|---|
| Sex | |
| female | 314 |
| male | 577 |
Consider checking the raw data, as often boxplots or simple statistics are not enough to understand the diversity inside the data; see for instance the discussion here: https://sigmoid.social/@ct_bergstrom@fediscience.org/110267907203021857
References:
- Practical Business Python, Comprehensive Guide to Grouping and Aggregating with Pandas, by Chris Moffitt
pd.crosstab
pd.crosstab(
df_titanic_raw["Sex"],
df_titanic_raw["Pclass"],
values=df_titanic_raw["Sex"],
aggfunc="count",
normalize=False,
)| Pclass | 1 | 2 | 3 |
|---|---|---|---|
| Sex | |||
| female | 94 | 76 | 144 |
| male | 122 | 108 | 347 |
df_titanic| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
| 10 | 11 | 1 | 3 | Sandstrom, Miss. Marguerite Rut | female | 4.0 | 1 | 1 | PP 9549 | 16.7000 | G6 | S |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 872 | 873 | 0 | 1 | Carlsson, Mr. Frans Olof | male | 33.0 | 0 | 0 | 695 | 5.0000 | B51 B53 B55 | S |
| 879 | 880 | 1 | 1 | Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | female | 56.0 | 0 | 1 | 11767 | 83.1583 | C50 | C |
| 887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
| 889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
183 rows × 12 columns
df_titanic.indexInt64Index([ 1, 3, 6, 10, 11, 21, 23, 27, 52, 54,
...
835, 853, 857, 862, 867, 871, 872, 879, 887, 889],
dtype='int64', length=183)
df_titanic.columnsIndex(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
pd.options.display.max_rows = 12
df_titanic.dtypes
df_titanic['Name'].astype(str)1 Cumings, Mrs. John Bradley (Florence Briggs Th...
3 Futrelle, Mrs. Jacques Heath (Lily May Peel)
6 McCarthy, Mr. Timothy J
10 Sandstrom, Miss. Marguerite Rut
11 Bonnell, Miss. Elizabeth
...
871 Beckwith, Mrs. Richard Leonard (Sallie Monypeny)
872 Carlsson, Mr. Frans Olof
879 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)
887 Graham, Miss. Margaret Edith
889 Behr, Mr. Karl Howell
Name: Name, Length: 183, dtype: object
Extract numpy arrays from dataframes
useful for using packages on top of pandas (e.g., sklearn, though nowadays it works out of the box with pandas)
array_titanic = df_titanic.values # associated numpy array
array_titanicarray([[2, 1, 1, ..., 71.2833, 'C85', 'C'],
[4, 1, 1, ..., 53.1, 'C123', 'S'],
[7, 0, 1, ..., 51.8625, 'E46', 'S'],
...,
[880, 1, 1, ..., 83.1583, 'C50', 'C'],
[888, 1, 1, ..., 30.0, 'B42', 'S'],
[890, 1, 1, ..., 30.0, 'C148', 'C']], dtype=object)
1D dataset: Series (a column of a DataFrame)
A Series is a labeled 1D column of a kind.
fare = df_titanic['Fare']
fare1 71.2833
3 53.1000
6 51.8625
10 16.7000
11 26.5500
...
871 52.5542
872 5.0000
879 83.1583
887 30.0000
889 30.0000
Name: Fare, Length: 183, dtype: float64
Attributes Series: indices and values
fare.values[:10]array([ 71.2833, 53.1 , 51.8625, 16.7 , 26.55 , 13. ,
35.5 , 263. , 76.7292, 61.9792])
Contrarily to numpy arrays, you can index with other formats than integers:
# Be careful, what follows changes the indexing
df_titanic_raw = df_titanic_raw.set_index('Name')
df_titanic_raw| PassengerId | Survived | Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Name | |||||||||||
| Braund, Mr. Owen Harris | 1 | 0 | 3 | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| Cumings, Mrs. John Bradley (Florence Briggs Thayer) | 2 | 1 | 1 | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| Heikkinen, Miss. Laina | 3 | 1 | 3 | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| Futrelle, Mrs. Jacques Heath (Lily May Peel) | 4 | 1 | 1 | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| Allen, Mr. William Henry | 5 | 0 | 3 | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Montvila, Rev. Juozas | 887 | 0 | 2 | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
| Graham, Miss. Margaret Edith | 888 | 1 | 1 | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
| Johnston, Miss. Catherine Helen "Carrie" | 889 | 0 | 3 | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
| Behr, Mr. Karl Howell | 890 | 1 | 1 | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
| Dooley, Mr. Patrick | 891 | 0 | 3 | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 11 columns
age = df_titanic_raw['Age']
age['Behr, Mr. Karl Howell']26.0
age.mean()29.69911764705882
df_titanic_raw[age < 2]| PassengerId | Survived | Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Name | |||||||||||
| Caldwell, Master. Alden Gates | 79 | 1 | 2 | male | 0.83 | 0 | 2 | 248738 | 29.0000 | NaN | S |
| Panula, Master. Eino Viljami | 165 | 0 | 3 | male | 1.00 | 4 | 1 | 3101295 | 39.6875 | NaN | S |
| Johnson, Miss. Eleanor Ileen | 173 | 1 | 3 | female | 1.00 | 1 | 1 | 347742 | 11.1333 | NaN | S |
| Becker, Master. Richard F | 184 | 1 | 2 | male | 1.00 | 2 | 1 | 230136 | 39.0000 | F4 | S |
| Allison, Master. Hudson Trevor | 306 | 1 | 1 | male | 0.92 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Hamalainen, Master. Viljo | 756 | 1 | 2 | male | 0.67 | 1 | 1 | 250649 | 14.5000 | NaN | S |
| Dean, Master. Bertram Vere | 789 | 1 | 3 | male | 1.00 | 1 | 2 | C.A. 2315 | 20.5750 | NaN | S |
| Thomas, Master. Assad Alexander | 804 | 1 | 3 | male | 0.42 | 0 | 1 | 2625 | 8.5167 | NaN | C |
| Mallet, Master. Andre | 828 | 1 | 2 | male | 1.00 | 0 | 2 | S.C./PARIS 2079 | 37.0042 | NaN | C |
| Richards, Master. George Sibley | 832 | 1 | 2 | male | 0.83 | 1 | 1 | 29106 | 18.7500 | NaN | S |
14 rows × 11 columns
# You can come back to the original indexing
df_titanic_raw = df_titanic_raw.reset_index()Counting values for categorical variables
df_titanic_raw['Embarked'].value_counts(normalize=False, sort=True,
ascending=False)S 644
C 168
Q 77
Name: Embarked, dtype: int64
pd.options.display.max_rows = 70
df_titanic[df_titanic['Embarked'] == 'C']| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 52 | 53 | 1 | 1 | Harper, Mrs. Henry Sleeper (Myna Haxtun) | female | 49.0 | 1 | 0 | PC 17572 | 76.7292 | D33 | C |
| 54 | 55 | 0 | 1 | Ostby, Mr. Engelhart Cornelius | male | 65.0 | 0 | 1 | 113509 | 61.9792 | B30 | C |
| 96 | 97 | 0 | 1 | Goldschmidt, Mr. George B | male | 71.0 | 0 | 0 | PC 17754 | 34.6542 | A5 | C |
| 97 | 98 | 1 | 1 | Greenfield, Mr. William Bertram | male | 23.0 | 0 | 1 | PC 17759 | 63.3583 | D10 D12 | C |
| 118 | 119 | 0 | 1 | Baxter, Mr. Quigg Edmond | male | 24.0 | 0 | 1 | PC 17558 | 247.5208 | B58 B60 | C |
| 139 | 140 | 0 | 1 | Giglio, Mr. Victor | male | 24.0 | 0 | 0 | PC 17593 | 79.2000 | B86 | C |
| 174 | 175 | 0 | 1 | Smith, Mr. James Clinch | male | 56.0 | 0 | 0 | 17764 | 30.6958 | A7 | C |
| 177 | 178 | 0 | 1 | Isham, Miss. Ann Elizabeth | female | 50.0 | 0 | 0 | PC 17595 | 28.7125 | C49 | C |
| 194 | 195 | 1 | 1 | Brown, Mrs. James Joseph (Margaret Tobin) | female | 44.0 | 0 | 0 | PC 17610 | 27.7208 | B4 | C |
| 195 | 196 | 1 | 1 | Lurette, Miss. Elise | female | 58.0 | 0 | 0 | PC 17569 | 146.5208 | B80 | C |
| 209 | 210 | 1 | 1 | Blank, Mr. Henry | male | 40.0 | 0 | 0 | 112277 | 31.0000 | A31 | C |
| 215 | 216 | 1 | 1 | Newell, Miss. Madeleine | female | 31.0 | 1 | 0 | 35273 | 113.2750 | D36 | C |
| 218 | 219 | 1 | 1 | Bazzani, Miss. Albina | female | 32.0 | 0 | 0 | 11813 | 76.2917 | D15 | C |
| 273 | 274 | 0 | 1 | Natsch, Mr. Charles H | male | 37.0 | 0 | 1 | PC 17596 | 29.7000 | C118 | C |
| 291 | 292 | 1 | 1 | Bishop, Mrs. Dickinson H (Helen Walton) | female | 19.0 | 1 | 0 | 11967 | 91.0792 | B49 | C |
| 292 | 293 | 0 | 2 | Levy, Mr. Rene Jacques | male | 36.0 | 0 | 0 | SC/Paris 2163 | 12.8750 | D | C |
| 299 | 300 | 1 | 1 | Baxter, Mrs. James (Helene DeLaudeniere Chaput) | female | 50.0 | 0 | 1 | PC 17558 | 247.5208 | B58 B60 | C |
| 307 | 308 | 1 | 1 | Penasco y Castellana, Mrs. Victor de Satode (M... | female | 17.0 | 1 | 0 | PC 17758 | 108.9000 | C65 | C |
| 309 | 310 | 1 | 1 | Francatelli, Miss. Laura Mabel | female | 30.0 | 0 | 0 | PC 17485 | 56.9292 | E36 | C |
| 310 | 311 | 1 | 1 | Hays, Miss. Margaret Bechstein | female | 24.0 | 0 | 0 | 11767 | 83.1583 | C54 | C |
| 311 | 312 | 1 | 1 | Ryerson, Miss. Emily Borie | female | 18.0 | 2 | 2 | PC 17608 | 262.3750 | B57 B59 B63 B66 | C |
| 319 | 320 | 1 | 1 | Spedden, Mrs. Frederic Oakley (Margaretta Corn... | female | 40.0 | 1 | 1 | 16966 | 134.5000 | E34 | C |
| 325 | 326 | 1 | 1 | Young, Miss. Marie Grice | female | 36.0 | 0 | 0 | PC 17760 | 135.6333 | C32 | C |
| 329 | 330 | 1 | 1 | Hippach, Miss. Jean Gertrude | female | 16.0 | 0 | 1 | 111361 | 57.9792 | B18 | C |
| 337 | 338 | 1 | 1 | Burns, Miss. Elizabeth Margaret | female | 41.0 | 0 | 0 | 16966 | 134.5000 | E40 | C |
| 366 | 367 | 1 | 1 | Warren, Mrs. Frank Manley (Anna Sophia Atkinson) | female | 60.0 | 1 | 0 | 110813 | 75.2500 | D37 | C |
| 369 | 370 | 1 | 1 | Aubart, Mme. Leontine Pauline | female | 24.0 | 0 | 0 | PC 17477 | 69.3000 | B35 | C |
| 370 | 371 | 1 | 1 | Harder, Mr. George Achilles | male | 25.0 | 1 | 0 | 11765 | 55.4417 | E50 | C |
| 377 | 378 | 0 | 1 | Widener, Mr. Harry Elkins | male | 27.0 | 0 | 2 | 113503 | 211.5000 | C82 | C |
| 393 | 394 | 1 | 1 | Newell, Miss. Marjorie | female | 23.0 | 1 | 0 | 35273 | 113.2750 | D36 | C |
| 452 | 453 | 0 | 1 | Foreman, Mr. Benjamin Laventall | male | 30.0 | 0 | 0 | 113051 | 27.7500 | C111 | C |
| 453 | 454 | 1 | 1 | Goldenberg, Mr. Samuel L | male | 49.0 | 1 | 0 | 17453 | 89.1042 | C92 | C |
| 473 | 474 | 1 | 2 | Jerwan, Mrs. Amin S (Marie Marthe Thuillard) | female | 23.0 | 0 | 0 | SC/AH Basle 541 | 13.7917 | D | C |
| 484 | 485 | 1 | 1 | Bishop, Mr. Dickinson H | male | 25.0 | 1 | 0 | 11967 | 91.0792 | B49 | C |
| 487 | 488 | 0 | 1 | Kent, Mr. Edward Austin | male | 58.0 | 0 | 0 | 11771 | 29.7000 | B37 | C |
| 496 | 497 | 1 | 1 | Eustis, Miss. Elizabeth Mussey | female | 54.0 | 1 | 0 | 36947 | 78.2667 | D20 | C |
| 505 | 506 | 0 | 1 | Penasco y Castellana, Mr. Victor de Satode | male | 18.0 | 1 | 0 | PC 17758 | 108.9000 | C65 | C |
| 523 | 524 | 1 | 1 | Hippach, Mrs. Louis Albert (Ida Sophia Fischer) | female | 44.0 | 0 | 1 | 111361 | 57.9792 | B18 | C |
| 539 | 540 | 1 | 1 | Frolicher, Miss. Hedwig Margaritha | female | 22.0 | 0 | 2 | 13568 | 49.5000 | B39 | C |
| 544 | 545 | 0 | 1 | Douglas, Mr. Walter Donald | male | 50.0 | 1 | 0 | PC 17761 | 106.4250 | C86 | C |
| 550 | 551 | 1 | 1 | Thayer, Mr. John Borland Jr | male | 17.0 | 0 | 2 | 17421 | 110.8833 | C70 | C |
| 556 | 557 | 1 | 1 | Duff Gordon, Lady. (Lucille Christiana Sutherl... | female | 48.0 | 1 | 0 | 11755 | 39.6000 | A16 | C |
| 581 | 582 | 1 | 1 | Thayer, Mrs. John Borland (Marian Longstreth M... | female | 39.0 | 1 | 1 | 17421 | 110.8833 | C68 | C |
| 583 | 584 | 0 | 1 | Ross, Mr. John Hugo | male | 36.0 | 0 | 0 | 13049 | 40.1250 | A10 | C |
| 587 | 588 | 1 | 1 | Frolicher-Stehli, Mr. Maxmillian | male | 60.0 | 1 | 1 | 13567 | 79.2000 | B41 | C |
| 591 | 592 | 1 | 1 | Stephenson, Mrs. Walter Bertram (Martha Eustis) | female | 52.0 | 1 | 0 | 36947 | 78.2667 | D20 | C |
| 599 | 600 | 1 | 1 | Duff Gordon, Sir. Cosmo Edmund ("Mr Morgan") | male | 49.0 | 1 | 0 | PC 17485 | 56.9292 | A20 | C |
| 632 | 633 | 1 | 1 | Stahelin-Maeglin, Dr. Max | male | 32.0 | 0 | 0 | 13214 | 30.5000 | B50 | C |
| 641 | 642 | 1 | 1 | Sagesser, Mlle. Emma | female | 24.0 | 0 | 0 | PC 17477 | 69.3000 | B35 | C |
| 645 | 646 | 1 | 1 | Harper, Mr. Henry Sleeper | male | 48.0 | 1 | 0 | PC 17572 | 76.7292 | D33 | C |
| 647 | 648 | 1 | 1 | Simonius-Blumer, Col. Oberst Alfons | male | 56.0 | 0 | 0 | 13213 | 35.5000 | A26 | C |
| 659 | 660 | 0 | 1 | Newell, Mr. Arthur Webster | male | 58.0 | 0 | 2 | 35273 | 113.2750 | D48 | C |
| 679 | 680 | 1 | 1 | Cardeza, Mr. Thomas Drake Martinez | male | 36.0 | 0 | 1 | PC 17755 | 512.3292 | B51 B53 B55 | C |
| 681 | 682 | 1 | 1 | Hassab, Mr. Hammad | male | 27.0 | 0 | 0 | PC 17572 | 76.7292 | D49 | C |
| 698 | 699 | 0 | 1 | Thayer, Mr. John Borland | male | 49.0 | 1 | 1 | 17421 | 110.8833 | C68 | C |
| 700 | 701 | 1 | 1 | Astor, Mrs. John Jacob (Madeleine Talmadge Force) | female | 18.0 | 1 | 0 | PC 17757 | 227.5250 | C62 C64 | C |
| 710 | 711 | 1 | 1 | Mayne, Mlle. Berthe Antonine ("Mrs de Villiers") | female | 24.0 | 0 | 0 | PC 17482 | 49.5042 | C90 | C |
| 716 | 717 | 1 | 1 | Endres, Miss. Caroline Louise | female | 38.0 | 0 | 0 | PC 17757 | 227.5250 | C45 | C |
| 737 | 738 | 1 | 1 | Lesurer, Mr. Gustave J | male | 35.0 | 0 | 0 | PC 17755 | 512.3292 | B101 | C |
| 742 | 743 | 1 | 1 | Ryerson, Miss. Susan Parker "Suzette" | female | 21.0 | 2 | 2 | PC 17608 | 262.3750 | B57 B59 B63 B66 | C |
| 789 | 790 | 0 | 1 | Guggenheim, Mr. Benjamin | male | 46.0 | 0 | 0 | PC 17593 | 79.2000 | B82 B84 | C |
| 835 | 836 | 1 | 1 | Compton, Miss. Sara Rebecca | female | 39.0 | 1 | 1 | PC 17756 | 83.1583 | E49 | C |
| 879 | 880 | 1 | 1 | Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | female | 56.0 | 0 | 1 | 11767 | 83.1583 | C50 | C |
| 889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
Comments: not all passengers from Cherbourg are Gallic (🇫🇷: gaulois) …
What is the survival rate for raw data?
df_titanic_raw['Survived'].mean()0.3838383838383838
What is the survival rate for data after removing missing values?
df_titanic['Survived'].mean()0.6721311475409836
See also the command:
df_titanic.groupby(['Sex'])[['Survived', 'Age', 'Fare']].mean()| Survived | Age | Fare | |
|---|---|---|---|
| Sex | |||
| female | 0.931818 | 32.676136 | 89.000900 |
| male | 0.431579 | 38.451789 | 69.124343 |
Conclusion: Be careful when you remove some missing values, the missingness might be informative!
Data import/export
The Pandas library supports many formats:
- CSV, text
- SQL database
- Excel
- HDF5
- JSON
- HTML
- pickle
- sas, stata
- …
Exploration
pd.options.display.max_rows = 8
df_titanic_raw.tail()| Name | PassengerId | Survived | Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 886 | Montvila, Rev. Juozas | 887 | 0 | 2 | male | 27.0 | 0 | 0 | 211536 | 13.00 | NaN | S |
| 887 | Graham, Miss. Margaret Edith | 888 | 1 | 1 | female | 19.0 | 0 | 0 | 112053 | 30.00 | B42 | S |
| 888 | Johnston, Miss. Catherine Helen "Carrie" | 889 | 0 | 3 | female | NaN | 1 | 2 | W./C. 6607 | 23.45 | NaN | S |
| 889 | Behr, Mr. Karl Howell | 890 | 1 | 1 | male | 26.0 | 0 | 0 | 111369 | 30.00 | C148 | C |
| 890 | Dooley, Mr. Patrick | 891 | 0 | 3 | male | 32.0 | 0 | 0 | 370376 | 7.75 | NaN | Q |
df_titanic_raw.head()| Name | PassengerId | Survived | Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Braund, Mr. Owen Harris | 1 | 0 | 3 | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 2 | 1 | 1 | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | Heikkinen, Miss. Laina | 3 | 1 | 3 | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 4 | 1 | 1 | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | Allen, Mr. William Henry | 5 | 0 | 3 | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
Access values by line/columns etc.
iloc
df_titanic_raw.iloc[0:2, 1:8]| PassengerId | Survived | Pclass | Sex | Age | SibSp | Parch | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | male | 22.0 | 1 | 0 |
| 1 | 2 | 1 | 1 | female | 38.0 | 1 | 0 |
loc
# with original index:
# df_titanic_raw.loc[128]
# with naming indexing
df_titanic_raw = df_titanic_raw.set_index('Name') # You can only do it once !!
df_titanic_raw.loc['Bonnell, Miss. Elizabeth', 'Fare']26.55
df_titanic_raw.loc['Bonnell, Miss. Elizabeth']PassengerId 12
Survived 1
Pclass 1
Sex female
...
Ticket 113783
Fare 26.55
Cabin C103
Embarked S
Name: Bonnell, Miss. Elizabeth, Length: 11, dtype: object
df_titanic_raw.loc['Bonnell, Miss. Elizabeth', 'Survived']
df_titanic_raw.loc['Bonnell, Miss. Elizabeth', 'Survived'] = 0df_titanic_raw.loc['Bonnell, Miss. Elizabeth']PassengerId 12
Survived 0
Pclass 1
Sex female
...
Ticket 113783
Fare 26.55
Cabin C103
Embarked S
Name: Bonnell, Miss. Elizabeth, Length: 11, dtype: object
# set back the original value
df_titanic_raw.loc['Bonnell, Miss. Elizabeth', 'Survived'] = 1
df_titanic_raw = df_titanic_raw.reset_index() # Back to original indexinggroupby
df_titanic.groupby(['Sex'])[['Survived', 'Age', 'Fare', 'Pclass']].mean()| Survived | Age | Fare | Pclass | |
|---|---|---|---|---|
| Sex | ||||
| female | 0.931818 | 32.676136 | 89.000900 | 1.215909 |
| male | 0.431579 | 38.451789 | 69.124343 | 1.168421 |
Create binned values
bins=np.arange(0, 100, 10)
current_palette = sns.color_palette()
df_test = pd.DataFrame({ 'Age': pd.cut(df_titanic['Age'], bins=bins, right=False)})
ax = sns.countplot(data=df_test, x='Age', color=current_palette[0])
ax.tick_params(axis='x', labelrotation=90)
Second case study: air quality in Paris (Source: Airparif)
url = "http://josephsalmon.eu/enseignement/datasets/20080421_20160927-PA13_auto.csv"
path_target = "./20080421_20160927-PA13_auto.csv"
path, fname = os.path.split(path_target)
pooch.retrieve(url, path=path, fname=fname, known_hash=None)'/home/bcharlier/enseignement/HAX712/HAX712X/Courses/Pandas/20080421_20160927-PA13_auto.csv'
You can run for instance in a terminal:
!head -26 ./20080421_20160927-PA13_auto.csvAlternatively:
from IPython import get_ipython
get_ipython().system('head -26 ./20080421_20160927-PA13_auto.csv')References:
- Working with time series, Python Data Science Handbook by Jake VanderPlas
polution_df = pd.read_csv('20080421_20160927-PA13_auto.csv', sep=';',
comment='#',
na_values="n/d",
converters={'heure': str})pd.options.display.max_rows = 30
polution_df.head(25)| date | heure | NO2 | O3 | |
|---|---|---|---|---|
| 0 | 21/04/2008 | 1 | 13.0 | 74.0 |
| 1 | 21/04/2008 | 2 | 11.0 | 73.0 |
| 2 | 21/04/2008 | 3 | 13.0 | 64.0 |
| 3 | 21/04/2008 | 4 | 23.0 | 46.0 |
| 4 | 21/04/2008 | 5 | 47.0 | 24.0 |
| 5 | 21/04/2008 | 6 | 70.0 | 11.0 |
| 6 | 21/04/2008 | 7 | 70.0 | 17.0 |
| 7 | 21/04/2008 | 8 | 76.0 | 16.0 |
| 8 | 21/04/2008 | 9 | NaN | NaN |
| 9 | 21/04/2008 | 10 | NaN | NaN |
| 10 | 21/04/2008 | 11 | NaN | NaN |
| 11 | 21/04/2008 | 12 | 33.0 | 60.0 |
| 12 | 21/04/2008 | 13 | 31.0 | 61.0 |
| 13 | 21/04/2008 | 14 | 37.0 | 61.0 |
| 14 | 21/04/2008 | 15 | 20.0 | 78.0 |
| 15 | 21/04/2008 | 16 | 29.0 | 71.0 |
| 16 | 21/04/2008 | 17 | 30.0 | 70.0 |
| 17 | 21/04/2008 | 18 | 38.0 | 58.0 |
| 18 | 21/04/2008 | 19 | 52.0 | 40.0 |
| 19 | 21/04/2008 | 20 | 56.0 | 29.0 |
| 20 | 21/04/2008 | 21 | 39.0 | 40.0 |
| 21 | 21/04/2008 | 22 | 31.0 | 42.0 |
| 22 | 21/04/2008 | 23 | 29.0 | 42.0 |
| 23 | 21/04/2008 | 24 | 28.0 | 36.0 |
| 24 | 22/04/2008 | 1 | 46.0 | 16.0 |
Data preprocessing
# check types
polution_df.dtypes
# check all
polution_df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73920 entries, 0 to 73919
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 73920 non-null object
1 heure 73920 non-null object
2 NO2 71008 non-null float64
3 O3 71452 non-null float64
dtypes: float64(2), object(2)
memory usage: 2.3+ MB
For more info on the nature of Pandas objects, see this discussion on Stackoverflow. Moreover, things are slowly moving from numpy to pyarrow, cf. Pandas user guide
Issues with non-conventional hours/day format
Start by changing to integer type (e.g., int8):
polution_df['heure'] = polution_df['heure'].astype(np.int8)
polution_df['heure']0 1
1 2
2 3
3 4
4 5
..
73915 20
73916 21
73917 22
73918 23
73919 24
Name: heure, Length: 73920, dtype: int8
No data is from 1 to 24… not conventional so let’s make it from 0 to 23
polution_df['heure'] = polution_df['heure'] - 1
polution_df['heure']0 0
1 1
2 2
3 3
4 4
..
73915 19
73916 20
73917 21
73918 22
73919 23
Name: heure, Length: 73920, dtype: int8
and back to strings:
polution_df['heure'] = polution_df['heure'].astype('str')
polution_df['heure']0 0
1 1
2 2
3 3
4 4
..
73915 19
73916 20
73917 21
73918 22
73919 23
Name: heure, Length: 73920, dtype: object
Time processing
Note that we have used the following conventions: - d = day - m=month - Y=year - H=hour - M=minutes
time_improved = pd.to_datetime(polution_df['date'] +
' ' + polution_df['heure'] + ':00',
format='%d/%m/%Y %H:%M')
time_improved0 2008-04-21 00:00:00
1 2008-04-21 01:00:00
2 2008-04-21 02:00:00
3 2008-04-21 03:00:00
4 2008-04-21 04:00:00
...
73915 2016-09-27 19:00:00
73916 2016-09-27 20:00:00
73917 2016-09-27 21:00:00
73918 2016-09-27 22:00:00
73919 2016-09-27 23:00:00
Length: 73920, dtype: datetime64[ns]
polution_df['date'] + ' ' + polution_df['heure'] + ':00'0 21/04/2008 0:00
1 21/04/2008 1:00
2 21/04/2008 2:00
3 21/04/2008 3:00
4 21/04/2008 4:00
...
73915 27/09/2016 19:00
73916 27/09/2016 20:00
73917 27/09/2016 21:00
73918 27/09/2016 22:00
73919 27/09/2016 23:00
Length: 73920, dtype: object
Create correct timing format in the dataframe
polution_df['DateTime'] = time_improved
# remove useless columns:
del polution_df['heure']
del polution_df['date']
polution_df| NO2 | O3 | DateTime | |
|---|---|---|---|
| 0 | 13.0 | 74.0 | 2008-04-21 00:00:00 |
| 1 | 11.0 | 73.0 | 2008-04-21 01:00:00 |
| 2 | 13.0 | 64.0 | 2008-04-21 02:00:00 |
| 3 | 23.0 | 46.0 | 2008-04-21 03:00:00 |
| 4 | 47.0 | 24.0 | 2008-04-21 04:00:00 |
| ... | ... | ... | ... |
| 73915 | 55.0 | 31.0 | 2016-09-27 19:00:00 |
| 73916 | 85.0 | 5.0 | 2016-09-27 20:00:00 |
| 73917 | 75.0 | 9.0 | 2016-09-27 21:00:00 |
| 73918 | 64.0 | 15.0 | 2016-09-27 22:00:00 |
| 73919 | 57.0 | 14.0 | 2016-09-27 23:00:00 |
73920 rows × 3 columns
Visualize the data set now that the time is well formatted:
polution_ts = polution_df.set_index(['DateTime'])
polution_ts = polution_ts.sort_index(ascending=True)
polution_ts.head(12)| NO2 | O3 | |
|---|---|---|
| DateTime | ||
| 2008-04-21 00:00:00 | 13.0 | 74.0 |
| 2008-04-21 01:00:00 | 11.0 | 73.0 |
| 2008-04-21 02:00:00 | 13.0 | 64.0 |
| 2008-04-21 03:00:00 | 23.0 | 46.0 |
| 2008-04-21 04:00:00 | 47.0 | 24.0 |
| 2008-04-21 05:00:00 | 70.0 | 11.0 |
| 2008-04-21 06:00:00 | 70.0 | 17.0 |
| 2008-04-21 07:00:00 | 76.0 | 16.0 |
| 2008-04-21 08:00:00 | NaN | NaN |
| 2008-04-21 09:00:00 | NaN | NaN |
| 2008-04-21 10:00:00 | NaN | NaN |
| 2008-04-21 11:00:00 | 33.0 | 60.0 |
polution_ts.describe()| NO2 | O3 | |
|---|---|---|
| count | 71008.000000 | 71452.000000 |
| mean | 34.453414 | 39.610046 |
| std | 20.380702 | 28.837333 |
| min | 1.000000 | 0.000000 |
| 25% | 19.000000 | 16.000000 |
| 50% | 30.000000 | 38.000000 |
| 75% | 46.000000 | 58.000000 |
| max | 167.000000 | 211.000000 |
fig, axes = plt.subplots(2, 1, figsize=(6, 6), sharex=True)
axes[0].plot(polution_ts['O3'])
axes[0].set_title("Ozone polution: daily average in Paris")
axes[0].set_ylabel("Concentration (µg/m³)")
axes[1].plot(polution_ts['NO2'])
axes[1].set_title("Nitrogen polution: daily average in Paris")
axes[1].set_ylabel("Concentration (µg/m³)")
plt.show()
fig, axes = plt.subplots(2, 1, figsize=(10, 5), sharex=True)
axes[0].plot(polution_ts['O3'].resample('d').max(), '--')
axes[0].plot(polution_ts['O3'].resample('d').min(),'-.')
axes[0].set_title("Ozone polution: daily average in Paris")
axes[0].set_ylabel("Concentration (µg/m³)")
axes[1].plot(polution_ts['NO2'].resample('d').max(), '--')
axes[1].plot(polution_ts['NO2'].resample('d').min(), '-.')
axes[1].set_title("Nitrogen polution: daily average in Paris")
axes[1].set_ylabel("Concentration (µg/m³)")
plt.show()
Source: https://www.tutorialspoint.com/python/time_strptime.htm
fig, ax = plt.subplots(1, 1)
polution_ts['2008':].resample('Y').mean().plot(ax=ax)
# Sample by year (A pour Annual) or Y for Year
plt.ylim(0, 50)
plt.title("Pollution evolution: \n yearly average in Paris")
plt.ylabel("Concentration (µg/m³)")
plt.xlabel("Year")
plt.show()
Loading colors:
sns.set_palette("GnBu_d", n_colors=7)
polution_ts['weekday'] = polution_ts.index.weekday # Monday=0, Sunday=6
polution_ts['weekend'] = polution_ts['weekday'].isin([5, 6])
polution_week_no2 = polution_ts.groupby(['weekday', polution_ts.index.hour])[
'NO2'].mean().unstack(level=0)
polution_week_03 = polution_ts.groupby(['weekday', polution_ts.index.hour])[
'O3'].mean().unstack(level=0)
plt.show()fig, axes = plt.subplots(2, 1, figsize=(7, 7), sharex=True)
polution_week_no2.plot(ax=axes[0])
axes[0].set_ylabel("Concentration (µg/m³)")
axes[0].set_xlabel("Intraday evolution")
axes[0].set_title(
"Daily NO2 concentration: weekend effect?")
axes[0].set_xticks(np.arange(0, 24))
axes[0].set_xticklabels(np.arange(0, 24), rotation=45)
axes[0].set_ylim(0, 60)
polution_week_03.plot(ax=axes[1])
axes[1].set_ylabel("Concentration (µg/m³)")
axes[1].set_xlabel("Intraday evolution")
axes[1].set_title("Daily O3 concentration: weekend effect?")
axes[1].set_xticks(np.arange(0, 24))
axes[1].set_xticklabels(np.arange(0, 24), rotation=45)
axes[1].set_ylim(0, 70)
axes[0].legend().set_visible(False)
# ax.legend()
axes[1].legend(labels=[day for day in calendar.day_name], loc='lower left', bbox_to_anchor=(1, 0.1))
plt.tight_layout()
plt.show()
polution_ts['month'] = polution_ts.index.month # Janvier=0, .... Decembre=11
polution_ts['month'] = polution_ts['month'].apply(lambda x:
calendar.month_abbr[x])
polution_ts.head()| NO2 | O3 | weekday | weekend | month | |
|---|---|---|---|---|---|
| DateTime | |||||
| 2008-04-21 00:00:00 | 13.0 | 74.0 | 0 | False | Apr |
| 2008-04-21 01:00:00 | 11.0 | 73.0 | 0 | False | Apr |
| 2008-04-21 02:00:00 | 13.0 | 64.0 | 0 | False | Apr |
| 2008-04-21 03:00:00 | 23.0 | 46.0 | 0 | False | Apr |
| 2008-04-21 04:00:00 | 47.0 | 24.0 | 0 | False | Apr |
polution_month_no2 = polution_ts.groupby(['month', polution_ts.index.hour])[
'NO2'].mean().unstack(level=0)
polution_month_03 = polution_ts.groupby(['month', polution_ts.index.hour])[
'O3'].mean().unstack(level=0)sns.set_palette("Paired", n_colors=12)
fig, axes = plt.subplots(2, 1, figsize=(7, 7), sharex=True)
polution_month_no2.plot(ax=axes[0])
axes[0].set_ylabel("Concentration (µg/m³)")
axes[0].set_xlabel("Hour of the day")
axes[0].set_title(
"Daily profile per month (NO2): weekend effect?")
axes[0].set_xticks(np.arange(0, 24))
axes[0].set_xticklabels(np.arange(0, 24), rotation=45)
axes[0].set_ylim(0, 90)
polution_month_03.plot(ax=axes[1])
axes[1].set_ylabel("Concentration (µg/m³)")
axes[1].set_xlabel("Heure de la journée")
axes[1].set_title("Daily profile per month (O3): weekend effect?")
axes[1].set_xticks(np.arange(0, 24))
axes[1].set_xticklabels(np.arange(0, 24), rotation=45)
axes[1].set_ylim(0, 90)
axes[0].legend().set_visible(False)
# ax.legend()
axes[1].legend(labels=calendar.month_name[1:], loc='lower left',
bbox_to_anchor=(1, 0.1))
plt.tight_layout()
plt.show()
Third case study: explore a dataset on bike accidents in France
References:
url = "https://koumoul.com/s/data-fair/api/v1/datasets/accidents-velos/raw"
path_target = "./bicycle_db.csv"
path, fname = os.path.split(path_target)
pooch.retrieve(url, path=path, fname=fname, known_hash=None)'/home/bcharlier/enseignement/HAX712/HAX712X/Courses/Pandas/bicycle_db.csv'
# df: data frame
df_bikes = pd.read_csv("bicycle_db.csv", na_values="", low_memory=False,
dtype={'data': str, 'heure': str, 'departement': str})In June 2023, the author decided to change the name of the columns, hence we had to define a dictionary to come back to legacy names:
new2old = {
"hrmn": "heure",
"secuexist": "existence securite",
"grav": "gravite accident",
"dep": "departement"
}
df_bikes.rename(columns=new2old, inplace=True)get_ipython().system('head -5 ./bicycle_db.csv')pd.options.display.max_columns = 40
df_bikes.head()| Num_Acc | date | an | mois | jour | heure | departement | com | lat | long | agg | int | col | lum | atm | catr | circ | nbv | prof | plan | lartpc | larrout | surf | infra | situ | gravite accident | sexe | age | trajet | existence securite | equipement | obs | obsm | choc | manv | vehiculeid | typevehicules | manoeuvehicules | numVehicules | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 200500000030 | 2005-01-13 | 2005 | janvier | jeudi | 19:45 | 62 | 62331 | 50.3 | 2.84 | 2 | 1 | 3.0 | 5 | 1.0 | 3 | 2.0 | 0.0 | -1.0 | 1.0 | 000 | 050 | 1.0 | 0.0 | 1.0 | 4 | 1 | 58.0 | 5.0 | 0 | 0 | 0.0 | 2.0 | 8.0 | 11.0 | 200500000030B02 | 18 | 17 | 1.0 |
| 1 | 200500000034 | 2005-01-19 | 2005 | janvier | mercredi | 10:45 | 62 | 62022 | 0 | 0 | 1 | 1 | 1.0 | 1 | 7.0 | 3 | 2.0 | 0.0 | 1.0 | 3.0 | 000 | 050 | 1.0 | 0.0 | 1.0 | 3 | 1 | 20.0 | 5.0 | 0 | 0 | 0.0 | 2.0 | 1.0 | 1.0 | 200500000034B02 | 10 | 15 | 1.0 |
| 2 | 200500000078 | 2005-01-26 | 2005 | janvier | mercredi | 13:15 | 02 | 02173 | 0 | 0 | 1 | 9 | 3.0 | 1 | 1.0 | 3 | 2.0 | 2.0 | 2.0 | 1.0 | 000 | 000 | 1.0 | 0.0 | 1.0 | 4 | 1 | 71.0 | 5.0 | 1 | 2 | 0.0 | 2.0 | 1.0 | 1.0 | 200500000078B02 | 7 | 15 | 1.0 |
| 3 | 200500000093 | 2005-01-03 | 2005 | janvier | lundi | 13:30 | 02 | 02810 | 49.255 | 3.094 | 2 | 1 | 1.0 | 1 | 1.0 | 3 | 2.0 | 0.0 | 1.0 | 2.0 | 000 | 052 | 1.0 | 0.0 | 1.0 | 3 | 2 | 51.0 | 4.0 | 0 | 0 | 0.0 | 2.0 | 3.0 | 21.0 | 200500000093B02 | 7 | 21 | 1.0 |
| 4 | 200500000170 | 2005-01-29 | 2005 | janvier | samedi | 18:30 | 76 | 76196 | 0 | 0 | 1 | 1 | 2.0 | 3 | 1.0 | 3 | 2.0 | 2.0 | 1.0 | 1.0 | 000 | 050 | 1.0 | 0.0 | 1.0 | 4 | 1 | 74.0 | 5.0 | 1 | 9 | 0.0 | 2.0 | 4.0 | 2.0 | 200500000170A01 | 10 | 2 | 1.0 |
df_bikes['existence securite'].unique()array([0, 1, 2])
df_bikes['gravite accident'].unique()array([4, 3, 2, 1])
Handle missing values in heure
df_bikes['date'].hasnans
df_bikes['heure'].hasnansFalse
pd.options.display.max_rows = 20
df_bikes.iloc[400:402]| Num_Acc | date | an | mois | jour | heure | departement | com | lat | long | agg | int | col | lum | atm | catr | circ | nbv | prof | plan | lartpc | larrout | surf | infra | situ | gravite accident | sexe | age | trajet | existence securite | equipement | obs | obsm | choc | manv | vehiculeid | typevehicules | manoeuvehicules | numVehicules | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 400 | 200500008935 | 2005-02-13 | 2005 | février | dimanche | 37: | 75 | 75018 | 0 | 0 | 2 | 2 | 3.0 | 5 | 6.0 | 4 | 2.0 | 4.0 | 1.0 | 1.0 | 000 | 120 | 1.0 | 0.0 | 1.0 | 4 | 1 | 33.0 | 1.0 | 1 | 2 | 0.0 | 2.0 | 3.0 | 7.0 | 200500008935B01 | 7 | 16 | 1.0 |
| 401 | 200500008941 | 2005-02-14 | 2005 | février | lundi | 15:05 | 75 | 75007 | 0 | 0 | 2 | 1 | 3.0 | 1 | 1.0 | 4 | 2.0 | 4.0 | 1.0 | 1.0 | 000 | 120 | 1.0 | 0.0 | 1.0 | 4 | 2 | 22.0 | 5.0 | 1 | 2 | 0.0 | 2.0 | 2.0 | 1.0 | 200500008941A01 | 7 | 9 | 1.0 |
Remove missing hours cases by np.nan:
df_bikes['heure'] = df_bikes['heure'].replace('', np.nan)
df_bikes.iloc[400:402]| Num_Acc | date | an | mois | jour | heure | departement | com | lat | long | agg | int | col | lum | atm | catr | circ | nbv | prof | plan | lartpc | larrout | surf | infra | situ | gravite accident | sexe | age | trajet | existence securite | equipement | obs | obsm | choc | manv | vehiculeid | typevehicules | manoeuvehicules | numVehicules | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 400 | 200500008935 | 2005-02-13 | 2005 | février | dimanche | 37: | 75 | 75018 | 0 | 0 | 2 | 2 | 3.0 | 5 | 6.0 | 4 | 2.0 | 4.0 | 1.0 | 1.0 | 000 | 120 | 1.0 | 0.0 | 1.0 | 4 | 1 | 33.0 | 1.0 | 1 | 2 | 0.0 | 2.0 | 3.0 | 7.0 | 200500008935B01 | 7 | 16 | 1.0 |
| 401 | 200500008941 | 2005-02-14 | 2005 | février | lundi | 15:05 | 75 | 75007 | 0 | 0 | 2 | 1 | 3.0 | 1 | 1.0 | 4 | 2.0 | 4.0 | 1.0 | 1.0 | 000 | 120 | 1.0 | 0.0 | 1.0 | 4 | 2 | 22.0 | 5.0 | 1 | 2 | 0.0 | 2.0 | 2.0 | 1.0 | 200500008941A01 | 7 | 9 | 1.0 |
df_bikes.dropna(subset=['heure'], inplace=True)
df_bikes.iloc[399:402]| Num_Acc | date | an | mois | jour | heure | departement | com | lat | long | agg | int | col | lum | atm | catr | circ | nbv | prof | plan | lartpc | larrout | surf | infra | situ | gravite accident | sexe | age | trajet | existence securite | equipement | obs | obsm | choc | manv | vehiculeid | typevehicules | manoeuvehicules | numVehicules | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 399 | 200500008875 | 2005-02-10 | 2005 | février | jeudi | 15:00 | 75 | 75016 | 0 | 0 | 2 | 1 | 2.0 | 1 | 1.0 | 4 | 2.0 | 4.0 | 1.0 | 1.0 | 000 | 120 | 1.0 | 0.0 | 1.0 | 4 | 1 | 54.0 | 5.0 | 0 | 0 | 0.0 | 2.0 | 4.0 | 1.0 | 200500008875B01 | 1 | 1 | 1.0 |
| 400 | 200500008935 | 2005-02-13 | 2005 | février | dimanche | 37: | 75 | 75018 | 0 | 0 | 2 | 2 | 3.0 | 5 | 6.0 | 4 | 2.0 | 4.0 | 1.0 | 1.0 | 000 | 120 | 1.0 | 0.0 | 1.0 | 4 | 1 | 33.0 | 1.0 | 1 | 2 | 0.0 | 2.0 | 3.0 | 7.0 | 200500008935B01 | 7 | 16 | 1.0 |
| 401 | 200500008941 | 2005-02-14 | 2005 | février | lundi | 15:05 | 75 | 75007 | 0 | 0 | 2 | 1 | 3.0 | 1 | 1.0 | 4 | 2.0 | 4.0 | 1.0 | 1.0 | 000 | 120 | 1.0 | 0.0 | 1.0 | 4 | 2 | 22.0 | 5.0 | 1 | 2 | 0.0 | 2.0 | 2.0 | 1.0 | 200500008941A01 | 7 | 9 | 1.0 |
::: {.callout-important appearance=‘default’ icon=“false”} ## EXERCISE: start/end of the study
Can you find the starting day and the ending day of the study automatically?
Hint: Sort the data! You can sort the data by time for instance, say with df.sort('Time').
df_bikes['date'] + ' ' + df_bikes['heure']0 2005-01-13 19:45
1 2005-01-19 10:45
2 2005-01-26 13:15
3 2005-01-03 13:30
4 2005-01-29 18:30
...
74753 2021-01-02 18:30
74754 2021-01-04 08:20
74755 2021-01-01 16:55
74756 2021-01-02 15:40
74757 2021-01-01 12:20
Length: 74758, dtype: object
# ADAPT OLD to create the df_bikes['Time']
time_improved = pd.to_datetime(df_bikes['date'] +
' ' + df_bikes['heure'],
format='%Y-%m-%d %H',
errors='coerce')
# Where d = day, m=month, Y=year, H=hour, M=minutesdf_bikes['Time'] = time_improved
# remove rows with NaT
df_bikes.dropna(subset=["Time"], inplace=True)
# set new index
df_bikes.set_index('Time', inplace=True)
# remove useless columns
df_bikes.drop(columns=['heure', 'date'], inplace=True)df_bikes.info()<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 61124 entries, 2005-01-13 19:45:00 to 2021-01-01 12:20:00
Data columns (total 37 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Num_Acc 61124 non-null int64
1 an 61124 non-null int64
2 mois 61124 non-null object
3 jour 61124 non-null object
4 departement 61124 non-null object
5 com 61124 non-null object
6 lat 61124 non-null object
7 long 60904 non-null object
8 agg 61124 non-null int64
9 int 61124 non-null int64
10 col 61121 non-null float64
11 lum 61124 non-null int64
12 atm 61121 non-null float64
13 catr 61124 non-null int64
14 circ 61002 non-null float64
15 nbv 60959 non-null float64
16 prof 60977 non-null float64
17 plan 60962 non-null float64
18 lartpc 50586 non-null object
19 larrout 56446 non-null object
20 surf 60980 non-null float64
21 infra 60689 non-null float64
22 situ 60717 non-null float64
23 gravite accident 61124 non-null int64
24 sexe 61124 non-null int64
25 age 61105 non-null float64
26 trajet 61121 non-null float64
27 existence securite 61124 non-null int64
28 equipement 57383 non-null object
29 obs 61102 non-null float64
30 obsm 61089 non-null float64
31 choc 61113 non-null float64
32 manv 61110 non-null float64
33 vehiculeid 61124 non-null object
34 typevehicules 52392 non-null object
35 manoeuvehicules 52377 non-null object
36 numVehicules 52392 non-null float64
dtypes: float64(16), int64(9), object(12)
memory usage: 17.7+ MB
df_bike2 = df_bikes.loc[
:, ["gravite accident", "existence securite", "age", "sexe"]
]
df_bike2["existence securite"].replace({"Inconnu": np.nan}, inplace=True)
df_bike2.dropna(inplace=True)| age | |||||
|---|---|---|---|---|---|
| existence securite | 0 | 1 | 2 | All | |
| gravite accident | sexe | ||||
| 1 | -1 | NaN | 1.0 | NaN | 1 |
| 1 | 203.0 | 3197.0 | 229.0 | 3629 | |
| 2 | 48.0 | 497.0 | 37.0 | 582 | |
| 2 | 1 | 190.0 | 1555.0 | 177.0 | 1922 |
| 2 | 60.0 | 254.0 | 28.0 | 342 | |
| 3 | 1 | 1383.0 | 13888.0 | 902.0 | 16173 |
| 2 | 429.0 | 3331.0 | 218.0 | 3978 | |
| 4 | 1 | 1063.0 | 22313.0 | 1561.0 | 24937 |
| 2 | 515.0 | 8437.0 | 589.0 | 9541 | |
| All | 3891.0 | 53473.0 | 3741.0 | 61105 | |
| gravite accident | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| existence securite | ||||
| 0 | 6.450784 | 6.425084 | 46.569005 | 40.555127 |
| 1 | 6.910029 | 3.383016 | 32.201298 | 57.505657 |
| 2 | 7.110398 | 5.479818 | 29.938519 | 57.471264 |
| gravite accident | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| existence securite | ||||
| 0 | 6.450784 | 6.425084 | 46.569005 | 40.555127 |
| 1 | 6.910029 | 3.383016 | 32.201298 | 57.505657 |
| 2 | 7.110398 | 5.479818 | 29.938519 | 57.471264 |
Series([], dtype: float64)
sexe
-1 0.000016
1 0.763563
2 0.236421
dtype: float64
| gravite accident | 1 | 2 | 3 | 4 | All |
|---|---|---|---|---|---|
| sexe | |||||
| -1 | 0.023742 | 0.000000 | 0.000000 | 0.000000 | 0.001637 |
| 1 | 86.158594 | 84.893993 | 80.259044 | 72.327281 | 76.362000 |
| 2 | 13.817664 | 15.106007 | 19.740956 | 27.672719 | 23.636364 |
To conclude
Note that in the dataset, the information on the level of bike practice by gender is missing.
df_bikes| Num_Acc | an | mois | jour | departement | com | lat | long | agg | int | col | lum | atm | catr | circ | nbv | prof | plan | lartpc | larrout | surf | infra | situ | gravite accident | sexe | age | trajet | existence securite | equipement | obs | obsm | choc | manv | vehiculeid | typevehicules | manoeuvehicules | numVehicules | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Time | |||||||||||||||||||||||||||||||||||||
| 2005-01-13 19:45:00 | 200500000030 | 2005 | janvier | jeudi | 62 | 62331 | 50.3 | 2.84 | 2 | 1 | 3.0 | 5 | 1.0 | 3 | 2.0 | 0.0 | -1.0 | 1.0 | 000 | 050 | 1.0 | 0.0 | 1.0 | 4 | 1 | 58.0 | 5.0 | 0 | 0 | 0.0 | 2.0 | 8.0 | 11.0 | 200500000030B02 | 18 | 17 | 1.0 |
| 2005-01-19 10:45:00 | 200500000034 | 2005 | janvier | mercredi | 62 | 62022 | 0 | 0 | 1 | 1 | 1.0 | 1 | 7.0 | 3 | 2.0 | 0.0 | 1.0 | 3.0 | 000 | 050 | 1.0 | 0.0 | 1.0 | 3 | 1 | 20.0 | 5.0 | 0 | 0 | 0.0 | 2.0 | 1.0 | 1.0 | 200500000034B02 | 10 | 15 | 1.0 |
| 2005-01-26 13:15:00 | 200500000078 | 2005 | janvier | mercredi | 02 | 02173 | 0 | 0 | 1 | 9 | 3.0 | 1 | 1.0 | 3 | 2.0 | 2.0 | 2.0 | 1.0 | 000 | 000 | 1.0 | 0.0 | 1.0 | 4 | 1 | 71.0 | 5.0 | 1 | 2 | 0.0 | 2.0 | 1.0 | 1.0 | 200500000078B02 | 7 | 15 | 1.0 |
| 2005-01-03 13:30:00 | 200500000093 | 2005 | janvier | lundi | 02 | 02810 | 49.255 | 3.094 | 2 | 1 | 1.0 | 1 | 1.0 | 3 | 2.0 | 0.0 | 1.0 | 2.0 | 000 | 052 | 1.0 | 0.0 | 1.0 | 3 | 2 | 51.0 | 4.0 | 0 | 0 | 0.0 | 2.0 | 3.0 | 21.0 | 200500000093B02 | 7 | 21 | 1.0 |
| 2005-01-29 18:30:00 | 200500000170 | 2005 | janvier | samedi | 76 | 76196 | 0 | 0 | 1 | 1 | 2.0 | 3 | 1.0 | 3 | 2.0 | 2.0 | 1.0 | 1.0 | 000 | 050 | 1.0 | 0.0 | 1.0 | 4 | 1 | 74.0 | 5.0 | 1 | 9 | 0.0 | 2.0 | 4.0 | 2.0 | 200500000170A01 | 10 | 2 | 1.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2021-01-02 18:30:00 | 202100056317 | 2021 | janvier | samedi | 44 | 44168 | 47,3777890000 | -2,1976410000 | 2 | 1 | 3.0 | 5 | 1.0 | 3 | 2.0 | 2.0 | 1.0 | 2.0 | NaN | -1 | 1.0 | 0.0 | 1.0 | 4 | 1 | 35.0 | 5.0 | 2 | NaN | 0.0 | 0.0 | 8.0 | 1.0 | 202100056317B01 | 7 | 14 | 1.0 |
| 2021-01-04 08:20:00 | 202100056362 | 2021 | janvier | lundi | 64 | 64138 | 43,2309460000 | -0,2765840000 | 2 | 3 | 1.0 | 1 | 1.0 | 3 | 2.0 | 2.0 | 1.0 | 1.0 | NaN | -1 | 1.0 | 0.0 | 1.0 | 4 | 1 | 53.0 | 1.0 | 1 | 2 | 0.0 | 2.0 | 0.0 | 1.0 | 202100056362B01 | 7 | 15 | 1.0 |
| 2021-01-01 16:55:00 | 202100056404 | 2021 | janvier | vendredi | 54 | 54395 | 48,6849869839 | 6,1760189384 | 2 | 2 | 3.0 | 5 | 1.0 | 4 | 1.0 | 2.0 | 1.0 | 1.0 | NaN | -1 | 1.0 | -1.0 | 1.0 | 4 | 1 | 12.0 | 5.0 | 1 | 2 | 0.0 | 2.0 | 1.0 | 25.0 | 202100056404A01 | 7 | 1 | 1.0 |
| 2021-01-02 15:40:00 | 202100056424 | 2021 | janvier | samedi | 75 | 75110 | 48,8769050000 | 2,3665940000 | 2 | 1 | 3.0 | 1 | 8.0 | 4 | 1.0 | 1.0 | 2.0 | 1.0 | NaN | -1 | 1.0 | -1.0 | 5.0 | 4 | 2 | 43.0 | 5.0 | 2 | NaN | 0.0 | 2.0 | 1.0 | 1.0 | 202100056424A01 | 7 | 9 | 1.0 |
| 2021-01-01 12:20:00 | 202100056508 | 2021 | janvier | vendredi | 64 | 64400 | 43,1465730000 | -0,1955440000 | 1 | 1 | 7.0 | 1 | 1.0 | 4 | 2.0 | 2.0 | 2.0 | 2.0 | NaN | -1 | 2.0 | 0.0 | 1.0 | 3 | 1 | 46.0 | 5.0 | 1 | 2/6 | 0.0 | 0.0 | 8.0 | 1.0 | 202100056508A01 | NaN | NaN | NaN |
61124 rows × 37 columns
# Chargement des couleurs
sns.set_palette("GnBu_d", n_colors=7)
df_bikes['weekday'] = df_bikes.index.day_of_week # Monday=0, Sunday=6
accidents_week = df_bikes.groupby(['weekday', df_bikes.index.hour])[
'sexe'].count().unstack(level=0)
fig, axes = plt.subplots(1, 1, figsize=(7, 7))
accidents_week.plot(ax=axes)
axes.set_ylabel("Accidents")
axes.set_xlabel("Heure de la journée")
axes.set_title(
"Profil journalier des accidents: effet du weekend?")
axes.set_xticks(np.arange(0, 24))
axes.set_xticklabels(np.arange(0, 24), rotation=45)
axes.legend(
labels=[day for day in calendar.day_name],
loc='upper left',
)
plt.tight_layout()
plt.show()
df_bikes.groupby(['weekday', df_bikes.index.hour])[
'sexe'].count()weekday Time
0 0 4
1 4
3 3
4 3
5 6
...
6 19 385
20 260
21 159
22 92
23 80
Name: sexe, Length: 164, dtype: int64
df_bikes['month'] = df_bikes.index.month # Janvier=0, .... Decembre=11
df_bikes['month'] = df_bikes['month'].apply(lambda x: calendar.month_abbr[x])
df_bikes.head()
sns.set_palette("GnBu_d", n_colors=12) # sns.set_palette("colorblind",...)
df_bikes_month = df_bikes.groupby(['month', df_bikes.index.hour])[
'age'].count().unstack(level=0)
fig, axes = plt.subplots(1, 1, figsize=(7, 7), sharex=True)
df_bikes_month.plot(ax=axes)
axes.set_ylabel("Concentration (µg/m³)")
axes.set_xlabel("Heure de la journée")
axes.set_title(
"Profil journalier de la pollution au NO2: effet du weekend?")
axes.set_xticks(np.arange(0, 24))
axes.set_xticklabels(np.arange(0, 24), rotation=45)
axes.legend(labels=calendar.month_name[1:], loc='upper left')
plt.tight_layout()
plt.show()
path_target = "./dpt_population.csv"
url = "https://public.opendatasoft.com/explore/dataset/population-francaise-par-departement-2018/download/?format=csv&timezone=Europe/Berlin&lang=en&use_labels_for_header=true&csv_separator=%3B"
path, fname = os.path.split(path_target)
pooch.retrieve(url, path=path, fname=fname, known_hash=None)'/home/bcharlier/enseignement/HAX712/HAX712X/Courses/Pandas/dpt_population.csv'
df_dtp_pop = pd.read_csv("dpt_population.csv", sep=";", low_memory=False)
df_dtp_pop['Code Département'].replace('2A', '20A',inplace=True)
df_dtp_pop['Code Département'].replace('2B', '20B',inplace=True)
df_dtp_pop.sort_values(by=['Code Département'], inplace=True)
df_bikes['departement'].replace('2A', '20A',inplace=True)
df_bikes['departement'].replace('2B', '20B',inplace=True)
df_bikes.sort_values(by=['departement'], inplace=True)
# Clean extra departements
df_bikes = df_bikes.loc[df_bikes['departement'].isin(df_dtp_pop['Code Département'].unique())]
gd = df_bikes.groupby(['departement'], as_index=True, sort=True).size()
data = {'code': gd.index,
'# Accidents per million': gd.values}
df = pd.DataFrame(data)
df['# Accidents per million'] = df['# Accidents per million'].values * 10000./ df_dtp_pop['Population'].valuespath_target = "./departements-avec-outre-mer.geojson"
url = "https://raw.githubusercontent.com/gregoiredavid/france-geojson/master/departements-avec-outre-mer.geojson"
path, fname = os.path.split(path_target)
pooch.retrieve(url, path=path, fname=fname, known_hash=None)'/home/bcharlier/enseignement/HAX712/HAX712X/Courses/Pandas/departements-avec-outre-mer.geojson'
import plotly.express as px
import geopandas
departement = geopandas.read_file('departements-avec-outre-mer.geojson')
departement['code'].replace('2A', '20A', inplace=True)
departement['code'].replace('2B', '20B', inplace=True)
departement.sort_values(by=['code'], inplace=True)
a = ['0'+ str(i) for i in range(1, 10)]
b = [str(i) for i in range(1, 10)]
dict_replace = dict(zip(a, b))
departement['code'].replace(dict_replace, inplace=True)
df['code'].replace(dict_replace, inplace=True)
departement['code'].replace('20A', '2A', inplace=True)
departement['code'].replace('20B', '2B', inplace=True)
df['code'].replace('20A', '2A',inplace=True)
df['code'].replace('20B', '2B',inplace=True)
departement.set_index('code', inplace=True)
fig = px.choropleth_mapbox(
df,
geojson=departement,
locations="code",
color="# Accidents per million",
range_color=(0, df['# Accidents per million'].max()),
color_continuous_scale="rdbu",
center={'lat': 47, 'lon': 2},
zoom=3.25,
mapbox_style="white-bg",
)
fig.update_traces(colorbar_orientation='h', selector=dict(type='choroplethmapbox'))
fig.update_layout(
title_text = 'Accidents per million inhabitants by department',
)
fig.layout.coloraxis.colorbar.thickness = 20
fig.layout.coloraxis.colorbar.orientation = 'h'
fig.layout.coloraxis.colorbar.y = -0.2
fig.layout.coloraxis.colorbar.x = 0.2
fig.show()References:
Other interactive tools for data visualization: Altair, Bokeh. See comparisons by Aarron Geller: link
An interesting tutorial: Altair introduction
Choropleth Maps in practice with Plotly and Python by Thibaud Lamothe